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Abstract 

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the col¬ 
lected rewards while interacting with their environment while using some prior knowledge 
that is accessed beforehand. Many BRL algorithms have already been proposed, but even 
though a few toy examples exist in the literature, there are still no extensive or rigorous 
benchmarks to compare them. The paper addresses this problem, and provides a new 
BRL comparison methodology along with the corresponding open source library. In this 
methodology, a comparison criterion that measures the performance of algorithms on large 
sets of Markov Decision Processes (MDPs) drawn from some probability distributions is 
defined. In order to enable the comparison of non-anytime algorithms, our methodology 
also includes a detailed analysis of the computation time requirement of each algorithm. 

Our library is released with all source code and documentation: it includes three test prob¬ 
lems, each of which has two different prior distributions, and seven state-of-the-art RL 
algorithms. Finally, our library is illustrated by comparing all the available algorithms and 
the results are discussed. 

Keywords: Bayesian Reinforcement Learning, Benchmarking, BBRL library. Offline 

Learning, Reinforcement Learning 

1. Introduction 

Reinforcement Learning (RL) agents aim to maximise collected rewards by interacting over 
a certain period of time in initially unknown environments. Actions that yield the high¬ 
est performance according to the current knowledge of the environment and those that 
maximise the gathering of new knowledge on the environment may not be the same. This 
is the dilemma known as Exploration/Exploitation (E/E). In such a context, using prior 
knowledge of the environment is extremely valuable, since it can help guide the decision- 
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making process in order to reduce the time spent on exploration. Model-based Bayesian 
Reinforcement Learning (BRL) (Dearden et al. ( |1999 ); Strens (2000)) specifically targets 
RL problems for which such a prior knowledge is encoded in the form of a probability 
distribution (the “prior”) over possible models of the environment. As the agent interacts 
with the actual model, this probability distribution is updated according to the Bayes rule 
into what is known as “posterior distribution”. The BRL process may be divided into two 
learning phases: the offline learning phase refers to the phase when the prior knowledge 
is used to warm-up the agent for its future interactions with the real model. The online 
learning phase, on the other hand, refers to the actual interactions between the agent and 
the model. In many applications, interacting with the actual environment may be very 
costly (e.g. medical experiments). In such cases, the experiments made during the online 
learning phase are likely to be much more expensive than those performed during the offline 
learning phase. 


In this paper, we investigate how the way BRL algorithms use the offline learning phase 
may impact online performances. To properly compare Bayesian algorithms, the first com¬ 


prehensive BRL benchmarking protocol is designed, following the foundations of Castronovo 


et al. (2014). “Comprehensive BRL benchmark” refers to a tool which assesses the perfor¬ 


mance of BRL algorithms over a large set of problems that are actually drawn according 
to a prior distribution. In previous papers addressing BRL, authors usually validate their 
algorithm by testing it on a few test problems, defined by a small set of predefined MDPs. 
For instance, BAMCP (Guez et al. ( 2012| )), SBOSS (Castro and Precup (2010)), and BFS3 


(Asmuth and Littman (2011)) are all validated on a fixed number of MDPs. In their val¬ 


idation process, the authors select a few BRL tasks, for which they choose one arbitrary 
transition function, which defines the corresponding MDP. Then, they define one prior dis¬ 
tribution compliant with the transition function. This type of benchmarking is problematic 
in the sense that the authors actually know the hidden transition function of each test case. 
It also creates an implicit incentive to over-fit their approach to a few specific transition 
functions, which should be completely unknown before interacting with the model. In this 
paper, we compare BRL algorithms in several different tasks. In each task, the real tran¬ 
sition function is defined using a random distribution, instead of being arbitrarily fixed. 
Each algorithm is thus tested on an infinitely large number of MDPs, for each test case. 
To perform our experiments, we developed the BBRL library, whose objective is to also 
provide other researchers with our benchmarking tool. 

This paper is organised as follows: Sectionpresents the problem statement. Section 
formally defines the experimental protocol designed for this paper. Section [^briefly presents 
the library. Section shows a detailed application of our protocol, comparing several well- 
know BRL algorithms on three different benchmarks. Section concludes the study. 


2. Problem Statement 

This section is dedicated to the formalisation of the different tools and concepts discussed 
in this paper. 
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2.1 Reinforcement Learning 

Let M = {X, U, /(•), PM,PMfl{'),'y) be a given unknown MDP, where X = ..., 

denotes its finite state space and U = ..., refers to its finite action space. When 

the MDP is in state xt at time t and action ut is selected, the agent moves instantaneously 
to a next state xt+i with a probability of P{xt+i\xt,ut) = f{xt,ut,xt+i). An instantaneous 
deterministic, bounded reward rt = pM{xt,ut,xt+i) G [Rmin,Rmax] is observed. 

Let ht = (xo, no, ro, xi, • • • , xt-i,ut-i,rt-i,xt) G H denote the history observed until 
time t. An E/E strategy is a stochastic policy tt which, given the current state xt, returns an 
action ut ~ vr(/it). Given a probability distribution over initial states Pm,o{-), the expected 
return of a given E/E strategy tt with respect to the MDP M can be defined as follows: 

rM= E ^^7^I,(xo)], 

where TZ\^{xq) is the stochastic sum of discounted rewards received when applying the 
policy TT, starting from an initial state xq: 


^m(^o) = 


+ O0 

EV 


t=o 


n- 


RL aims to learn the behaviour that maximises i.e. learning a policy tt* defined as 
follows: 


TT* G argmax J^. 

TT 


2.2 Prior Knowledge 


In this paper, the actual MDP is assumed to be initially unknown. Model-based Bayesian 
Reinforcement Learning (BRL) proposes to the model the uncertainty, using a probability 
distribution p\^{-) over a set of candidate MDPs M.. Such a probability distribution is 
called a prior distribution and can be used to encode specific prior knowledge available 
before interaction. Given a prior distribution p^{-), the expected return of a given E/E 
strategy tt is defined as: 




M. 


E ,, VI,] 


In the BRL framework, the goal is to maximise Z'^o , by finding n*, which is called 

Pm(') 

“Bayesian optimal policy” and defined as follows: 


TT* G argmax 

TT 


^TT 


2.3 Computation time characterisation 

Most BRL algorithms rely on some properties which, given sufficient computation time, 
ensure that their agents will converge to an optimal behaviour. However, it is not clear 
to know beforehand whether an algorithm will satisfy fixed computation time constraints 
while providing good performances. 
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The parameterisation of the algorithms makes the selection even more complex. Most 
BRL algorithms depend on parameters (number of transitions simulated at each iteration, 
etc.) which, in some way, can affect the computation time. In addition, for one given 
algorithm and fixed parameters, the computation time often varies from one simulation to 
another. These features make it nearly impossible to compare BRL algorithms under strict 
computation time constraints. In this paper, to address this problem, algorithms are run 
with multiple choices of parameters, and we analyse their time performance a posteriori. 

Furthermore, a distinction between the offline and online computation time is made. 
Offline computation time corresponds to the moment when the agent is able to exploit 
its prior knowledge, but cannot interact with the MDP yet. One can see it as the time 
given to take the first decision. In most algorithms concerned in this paper, this phase is 
generally used to initialise some data structure. On the other hand, online computation 
time corresponds to the time consumed by an algorithm for taking each decision. 

There are many ways to characterise algorithms based on their computation time. One 
can compare them based on the average time needed per step or on the offline computation 
time alone. To remain flexible, for each run of each algorithm, we store its computation 
times with i indexing the time step, and R_i the offline learning time. Then a 

feature function ij!)((Ri)_i<j) is extracted from this data. This function is used as a metric 
to characterise and discriminate algorithms based on their time requirements. 

In our protocol, which is detailed in the next section, two types of characterisation are 
used. For a set of experiments, algorithms are classified based on their offline computa¬ 
tion time only, i.e. we use ())((Rj)_i<j) = R_i. Afterwards, the constraint is defined as 
< K, K > 0 in case it is required to only compare the algorithms that have an 
offline computation time lower than K. 

For another set of experiments, algorithms are separated according to their empirical 
average online computation time. In this case, -i<i) = ^ Ylo<i<nBi- Algorithms can 

then be classified based on whether or not they respect the constraint < K, 

K>0. 

This formalisation could be used for any other computation time characterisation. For 
example, one could want to analyse algorithms based on the longest computation time of a 
trajectory, and define = max_i<jRj. 


3. A new Bayesian Reinforcement Learning benchmark protocol 
3.1 A comparison criterion for BRL 


In this paper, a real Bayesian evaluation is proposed, in the sense that the different al¬ 
gorithms are compared on a large set of problems drawn according to a test probability 
distribution. This is in contrast with the Bayesian literature (Guez et al. (2012); Castro 


and Precup (2010); Asmuth and Littman (2011)), where authors pick a fixed number of 


MDPs on which they evaluate their algorithm. 

Our criterion to compare algorithms is to measure their average rewards against a given 
random distribution of MDPs, using another distribution of MDPs as a prior knowledge. 
In our experimental protocol, an experiment is defined by a prior distribution p^(-) and 
a test distribution pm{')- Both are random distributions over the set of possible MDPs, 
not stochastic transition functions. To illustrate the difference, let us take an example. Let 
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{x,u,x') be a transition. Given a transition function f : X x U x X —)■ [0; 1], f{x,u,x') is 
the probability of observing x' if we chose u in x. In this paper, this function / is assumed 
to be the only unknown part of the MDP that the agent faces. Given a certain test case, 
/ corresponds to a unique MDP M £ Xi. A Bayesian learning problem is then defined by 
a probability distribution over a set Xi of possible MDPs. We call it a test distribution, 
and denote it Pm{')- Prior knowledge can then be encoded as another distribution over Xi, 
and denoted We call “accurate” a prior which is identical to the test distribution 

(p!m(') ~ Pm{'))j arid we call “inaccurate” a prior which is different / Pm{'))- 

In previous Bayesian literature, authors select a fixed number of MDPs Mi,...,M„, 
train and test their algorithm on them. Doing so does not guarantee any generalisation 
capabilities. To solve this problem, a protocol that allows rigorous comparison of BRL 
algorithms is designed. Training and test data are separated, and can even be generated 
from different distributions (in what we call the inaccurate case). 

More precisely, our protocol can be described as follows: Each algorithm is first trained 
on the prior distribution. Then, their performances are evaluated by estimating the expec¬ 
tation of the discounted sum of rewards, when they are facing MDPs drawn from the test 
distribution. Let be this value: 



E 


J 




M 


where t^{p\^) is the algorithm tt trained offline on In our Bayesian RL setting, we want 

^TT(pO ) r, 

to find the algorithm tt* which maximises 3pM^ for the {Pj^iPm) experiment: 

7r*eargmax 


In addition to the performance criterion, we also measure the empirical computation 
time. In practice, all problems are subject to time constraints. Hence, it is important to 
take this parameter into account when comparing different algorithms. 


3.2 The experimental protocol 

In practice, we can only sample a finite number of trajectories, and must rely on estimators 
to compare algorithms. In this section our experimental protocol is described, which is based 
on our comparison criterion for BRL and provides a detailed computation time analysis. 

An experiment is defined by (i) a prior distribution and (ii) a test distribution p_M. 
Given these, an agent is evaluated vr as follows: 


1. Train tt offline on p)^. 

2. Sample N MDPs from the test distribution pj^. 

3. For each sampled MDP M, compute estimate of 

4. Use these values to compute an estimate Zpm'^ ■ 


5 



Castronovo and Ernst and Coetoux and Fonteneau 


To estimate J 


<Pm) 


, the expected return of agent vr trained offline on one trajectory 

(pai) _ 


is sampled on the MDP M, and the cumulated return is computed J'^ ^ — 




To estimate this return, each trajectory is truncated after T steps. Therefore, given an 

-7r(p° ) ) 

MDP M and its initial state tq, we observe TZj^ (xq), an approximation of TZj^ (xq): 

T 




7*D- 


i=0 


If Rmax denotes the maximal instantaneous reward an agent can receive when interacting 
with an MDP drawn from pm^ then choosing T as guarantees the approximation error is 
bounded by e > 0: 

log(e X tPDd' 


T = 


R„ 


logy 


e = 0.01 is set for all experiments, as a compromise between measurement accuracy and 
computation time. 

Finally, to estimate our comparison criterion , the empirical average of the algo¬ 

rithm performance is computed over N different MDPs, sampled from : 

1 1 .^7r(p|b|) 




JV S 

0<i<N 


j^(Pm) _ 
- 


N 


E 

0<i<N 


M, 




( 1 ) 


For each agent tt, we retrieve and cr^, the empirical mean and standard devi¬ 

ation of the results observed respectively. This gives us the following statistical confidence 
interval at 95% for JJ^: 


Jm £ 


TTT I 

The values reported in the following figures and tables are estimations of the interval within 
which is, with probability 0.95. 


As introduced in Section 2.3 in our methodology, a function (j) of computation times is 
used to classify algorithms based on their time performance. The choice of 4> depends on the 
type of time constraints that are the most important to the user. In this paper, we reflect 
this by showing three different ways to choose 4>. These three choices lead to three different 
ways to look at the results and compare algorithms. The first one is to classify algorithms 
based on their offline computation time, the second one is to classify them based on the 
algorithms average online computation time. The third is a combination of the first two 
choices of 4>, that we denote (poffiiBi)-i<i) = B-i and (i>on{{Bi)-i<i) = ^ The 

objective is that for each pair of constraints (j)off{iBi)-i<i) < Ki and cponiiBi)-i<i) < K 2 , 
Ki,K 2 > 0, we want to identify the best algorithms that respect these constraints. In order 
to achieve this: (i) All agents that do not satisfy the constraints are discarded; (ii) for each 
algorithm, the agent leading to the best performance in average is selected; (iii) we build 
the list of agents whose performances are not significantly differenlj^ 

1. A paired sampled Z-test with a confidence level of 95% has been used to determine when two agents are 
statistically equivalent (more details in Appendix E 
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The results will help us to identify, for each experiment, the most suitable algorithm(s) 
depending on the constraints the agents must satisfy. This protocol is an extension of the 
one presented in Castronovo et al. (2014). 


4. BBRL library 

il5i?i0is a C++ open-source library for Bayesian Reinforcement Learning (discrete state/ac¬ 
tion spaces). This library provides high-level features, while remaining as flexible and doc¬ 
umented as possible to address the needs of any researcher of this held. To this end, we 
developed a complete command-line interface, along with a comprehensive website: 


https://github.com/mcastron/BBRL 


BBRL focuses on the core operations required to apply the comparison benchmark 
presented in this paper. To do a complete experiment with the BBRL library, follow these 
hve steps: 


1. We create a test and a prior distribution. Those distributions are represented by Flat 
Dirichlet Multinomial distributions (FDM), parameterised by a state space X, an ac¬ 
tion space U, a vector of parameters 0 , and reward function p. For more information 
about the FDM distributions, check Section 15.2 


./BBRL-DDS --mdp_distrib_generation \ 

--name <name> \ 

--short_name <short name> \ 

--n_states <nx> --n_actions <nij> \ 
--ini_state <Xo> \ 

--transition_weights \ 

< 0 ( 1 )> ••• <9{nxnunx)> \ 

--reward_type "RT_C0NSTANT" \ 
--reward_means \ 

<p(a:b),••• \ 

--output <output file> 


A distribution hie is created. 


2. We create an experiment. An experiment is dehned by a set of N MDPs, drawn from 
a test distribution dehned in a distribution file, a discount factor 7 and a horizon limit 
T. 

./BBRL-DDS --new_experiment \ 

--name <name> \ 

--mdp_distribution "DirMultiDistribution" \ 

--mdp_distribution_file <distribution file> \ 
--n_mdps <N> --n_simulations_per_mdp 1 \ 
--discount_factor < 7 > --horizon_limit <T> \ 


2. BBRL stands for Benchmaring tools for Bayesian Reinforcement Learning. 
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An experiment file is created and can be used to conduct the same experiment for 
several agents. 


3. We create an agent. An agent is defined by an algorithm alg, a set of parameters ■0; 
and a prior distribution defined in a distribution file, on which the created agent will 
be trained. 



An agent file is created. The file also stores the computation time observed during 
the offline training phase. 


4. We run the experiment. We need to provide an experiment file, an algorithm alg and 
an agent file. 



A result file is created. This file contains a set of all transitions encountered during 
each trajectory. Additionally, the computation times we observed are also stored in 
this file. It is often impossible to measure precisely the computation time of a single 
decision. This is why only the computation time of each trajectory is reported in this 
file. 

5. Our results are exported. After each experiment has been performed, a set of K result 
files is obtained. We need to provide all agent files and result files to export the data. 

./BBRL - export --agent <alg^^l> \ 

--agent_file <agent file #1> \ 

--experiment \ 

--experiment_file <result file #1> \ 
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--agent <alg^^^> \ 

--agent_file <agent file #K> \ 
--experiment \ 

--experiment_file <result file #K> 


BBRL will sort the data automatically and produce several files for each experiment. 

• A graph comparing offline computation cost w.r.t. performance; 

• A graph comparing online computation cost w.r.t. performance; 

• A graph where the X-axis represents the offline time bound, while the Y-axis 
represents the online time bound. A point of the space corresponds to set of 
bounds. An algorithm is associated to a point of the space if its best agent, 
satisfying the constraints, is among the best ones when compared to the others; 

• A table reporting the results of each agent. 

BBRL will also produce a report file in gathering the 3 graphs and the table 

for each experiment. 

More than 2.000 commands have to be entered in order to reproduce the results of 
this paper. We decided to provide several Lua script in order to simplify the process. By 
completing some conhguration files, the user can define the agents, the possible values of 
their parameters and the experiments to conduct. 


local agents = 

{ 

e-Greedy 

{ 

name = "EGreedyAgent " , 
params = 

{ 

{ 

opt = "--epsilon", 
values = 

{ 

0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 

> 

} 

>, 

olDptions = { "--compress_output" }, 
memory = { ol = "lOOOM", re = "lOOOM" }, 
duration = { ol = "01:00:00", re = "01:00:00" } 

}, 

} 


Figure 1: Example of a configuration file for the agents. 
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local 

experiments = 



{ 

{ 






prior = 

" GC" , 

priorFile 

= "GC-distrib.dat", 


exp 

" GC " , 

testFile 

= "GC-distrib.dat", 

>, 

} 

N = 500 , 

, gamma 

= 0.95, T 

= 250 


Figure 2: Example of a configuration file for the experiments. 

Those configuration files are then used by a script called make_scripts. sh, included 
within the library, whose purpose is to generate four other scripts; 

• 0-init.sh 

Create the experiment files, and create the formulas sets required by OPPS agents. 

• 1-ol.sh 

Create the agents and train them on the prior distribution(s). 

• 2-re.sh 

Run all the experiments. 

• 3-export.sh 

Generate the ET[^ reports. 

Due to the high computation power required, we made those scripts compatible with 
workload managers such as SLURM. In this case, each cluster should provide the same 
amount of CPU power in order to get consistent time measurements. To sum up, when the 
configuration files are completed correctly, one can start the whole process by executing the 
four scripts, and retrieve the results in nice reports. 

It is worth noting that there is no computation budget given to the agents. This is 
due to the diversity of the algorithms implemented. No algorithm is “anytime” natively, in 
the sense that we cannot stop the computation at any time and receive an answer from the 
agent instantly. Strictly speaking, it is possible to develop an anytime version of some of the 
algorithms considered in BBRL. However, we made the choice to stay as close as possible 
to the original algorithms proposed in their respective papers for reasons of fairness. In 
consequence, although computation time is a central parameter in our problem statement, 
it is never explicitly given to the agents. We instead let each agent run as long as necessary 
and analyse the time elapsed afterwards. 

Another point which needs to be discussed is the impact of the implementation of an al¬ 
gorithm on the comparison results. For each algorithm, many implementations are possible, 
some being better than others. Even though we did our best to provide the best possible 
implementations, BBRL does not compare algorithms but rather the implementations of 
each algorithms. Note that this issue mainly concerns small problems, since the complexity 
of the algorithms is preserved. 
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5. Illustration 


This section presents an illustration of the protocol presented in Section]^ We first describe 
the algorithms considered for the comparison in Section |5.1t followed by a description of 


the benchmarks in Section 5.2 Section 5.3 shows and analyses the results obtained. 


5.1 Compared algorithms 

In this section, we present the list of the algorithms considered in this study. The pseudo¬ 
code of each algorithm can be found in Appendix For each algorithm, a list of “reason¬ 
able” values is provided to test each of their parameters. When an algorithm has more than 
one parameter, all possible parameter combinations are tested. 


5.1.1 Random 

At each time-step t, the action ut is drawn uniformly from U. 

5.1.2 c-Greedy 

The e-Greedy agent maintains an approximation of the current MDP and computes, at each 
time-step, its associated Q-function. The selected action is either selected randomly (with 
a probability of e (1 > e > 0), or greedily (with a probability of 1 — e) with respect to the 
approximated model. 


Tested values: 

• e G {0.0,0.1,0.2, 0.3,0.4,0.5, 0.6, 0.7,0.8,0.9,1.0}. 


5.1.3 Soft-max 

The Soft-max agent maintains an approximation of the current MDP and computes, at 
each time-step, its associated Q-function. The selected action is selected randomly, where 
the probability to draw an action u is proportional to Q{xt, u). The temperature parameter 
r allows to control the impact of the Q-function on these probabilities (r —)• O"*": greedy 
selection; r —)• -|-oo: random selection). 

Tested values: 

• r G {0.05,0.10,0.20,0.33,0.50,1.0, 2.0,3.0, 5.0, 25.0}. 


5.1.4 OPPS 


Given a prior distribution (.) and an E/E strategy space S (either discrete or continuous), 
the Offline, Prior-based Policy Search algorithm (OPPS) identihes a strategy tt* G 5 which 
maximises the expected discounted sum of returns over MDPs drawn from the prior. 


The OPPS for Discrete Strategy spaces algorithm (OPPS-DS) (Castronovo et al. (2012 


2014)) formalises the strategy selection problem as a fe-armed bandit problem, where k = 
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|5|. Pulling an arm amounts to draw an MDP from and play the E/E strategy 

associated to this arm on it for one single trajectory. The discounted sum of returns observed 
is the return of this arm. This multi-armed bandit problem has been solved by using the 
UCBl algorithm (Auer et al. (2002); [Audibert et al. (2007)). The time budget is defined 
by a variable /3, corresponding to the total number of draws performed by the UCBl. 

The E/E strategies considered by Castronovo et. al are index-based strategies, where 
the index is generated by evaluating a small formula. A formula is a mathematical ex¬ 
pression, combining specific features (Q-functions of different models) by using standard 
mathematical operators (addition, subtraction, logarithm, etc.). The discrete E/E strategy 
space is the set of all formulas which can be built by combining at most n features/operators 
(such a set is denoted by F„). 

OPPS-DS does not come with any guarantee. However, the UCBl bandit algorithm 
used to identify the best E/E strategy within the set of strategies provides statistical guar¬ 
antees that the best E/E strategies are identihed with high probability after a certain budget 
of experiments. However, it is not clear that the best strategy of the E/E strategy space 
considered yields any high-performance strategy regardless the problem. 


Tested values: 

• 5 G {F2,F3,F4,F5,F6}|5 

• /3 G {50,500,1250,2500,5000,10000,100000,1000000}. 


5.1.5 BAMCP 


Bayes-adaptive Monte Carlo Planning (BAM CP) (|Guez et al. (2012)) is an evolution of 
the Upper Confidence Tree (UCT) algorithm ( Kocsis and Szepesvarij ( |2006 )), where each 
transition is sampled according to the history of observed transitions. The principle of this 
algorithm is to adapt the UCT principle for planning in a Bayes-adaptive MDP, also called 
the belief-augmented MDP, which is an MDP obtained when considering augmented states 
made of the concatenation of the actual state and the posterior. The BAMCP algorithm is 
made computationally tractable by using a sparse sampling strategy, which avoids sampling 
a model from the posterior distribution at every node of the planification tree. Note that the 
BAMCP also comes with theoretical guarantees of convergence towards Bayesian optimality. 

In practice, the BAMCP relies on two parameters: (i) Parameter K which defines the 
number of nodes created at each time-step, and (ii) Parameter depth which defines the 
depth of the tree from the root. 


Tested values: 

• a: G {1,500,1250, 2500, 5000,10000, 25000}, 

• depth G {15,25,50}. 


3. 


The number of arms k is always equal to the number of strategies in the given set. For your information: 
IFal = 12, IFal = 43, IF 4 I = 226, IF 5 I = 1210, IFe) = 7407 
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5.1.6 BFS3 


The Bayesian Forward Search Sparse Sampling (BFS3) ( Asmuth and Littmanj (2011)) is 
a Bayesian RL algorithm whose principle is to apply the principle of the FSSS (Forward 


Search Sparse Sampling, see Kearns et al. (2002)) algorithm to belief-augmented MDPs. 


It first samples one model from the posterior, which is then used to sample transitions. 
The algorithm then relies on lower and upper bounds on the value of each augmented state 
to prune the search space. The authors also show that BFS3 converges towards Bayes- 
optimality as the number of samples increases. 

In practice, the parameters of BFS3 are used to control how much computational power 
is allowed. The parameter K defines the number of nodes to develop at each time-step, C 
defines the branching factor of the tree and depth controls its maximal depth. 


Tested values: 

• K e {1,500,1250, 2500, 5000,10000}, 

• C E {2,5,10,15}, 

• depth E {15,25,50}. 


5.1.7 SBOSS 


The Smarter Best of Sampled Set (SBOSS) (Castro and Precup (2010)) is a Bayesian RL 


algorithm which relies on the assumption that the model is sampled from a Dirichlet dis¬ 
tribution. From this assumption, it derives uncertainty bounds on the value of state action 
pairs. It then uses those bounds to decide how many models to sample from the posterior, 
and how often the posterior should be updated in order to reduce the computational cost 
of Bayesian updates. The sampling technique is then used to build a merged MDP, as 
in Asmuth et al. (2009), and to derive the corresponding optimal action with respect to 
that MDP. In practice, the number of sampled models is determined dynamically with a 
parameter e. The re-sampling frequency depends on a parameter 6. 


Tested values: 


• e E {1.0, le — 1, le — 2, le — 3, le — 4, le — 5, le — 6}, 


• E {9, 7, 5,3,1, le - 1, le - 2, le - 3nle - 4, le - 5, le - 6}. 


5.1.8 BEB 


The Bayesian Exploration Bonus (BEB) (Kolter and Ng (2009)) is a Bayesian RL algorithm 
which builds, at each time-step t, the expected MDP given the current posterior. Before 
solving this MDP, it computes a new reward function u, y) = pm{x-, u, y) + — 


where denotes the number of times transition < x,u,y > has been observed at 
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time-step t. This algorithm solves the mean MDP of the current posterior, in which we re¬ 
placed pm{-, ■,■) by P^beb^'^ ■> ■)) applies its optimal policy on the current MDP for one 
step. The bonus /3 is a parameter controlling the E/E balance. BEB comes with theoretical 
guarantees of convergence towards Bayesian optimality. 

Tested values: 


• I 3 e {0.25,0.5,1,1.5,2,2.5,3,4,8,16}. 

5.1.9 Computation times variance 

Each algorithm has one or more parameters that can affect the number of sampled tran¬ 
sitions from a given state, or the length of each simulation. This, in turn, impacts the 
computation time requirement at each step. Hence, for some algorithms, no choice of pa¬ 
rameters can bring the computation time below or over certain valnes. In other words, each 
algorithm has its own range of computation time. Note that, for some methods, the com¬ 
putation time is influenced concurrently by several parameters. We present a qualitative 
description of how computation time varies as a function of parameters in Table 



Offline phase duration 

Online phase duration 

Random 

Almost instantaneous. 

Almost instantaneous. 

e-Greedjj^ 

Almost instantaneous. 

Varies in inverse proportion to e. 

Can vary a lot from one step to another. 

OPPS-DS 

Varies proportionally to (3. 

Varies proportionally to the number of fea¬ 
tures implied in the selected E/E strategy. 

BAMCF'^ 

Almost instantaneous. 

Varies proportionally to K and depth. 

BFSJ6 

Almost instantaneous. 

Varies proportionally to K, C and depth. 

SBOSS|^ 

Almost instantaneous. 

Varies in inverse proportion to e and S. 

Can vary a lot from one step to another, with 
a general decreasing tendency. 

BEB 

Almost instantaneous. 

Constant. 


Table 1: Influence of the algorithm and their parameters on the offline and online phases 
duration. 


4. If a random decision is chosen, the model is not solved. 

5. K defines the number of nodes to develop at each step, and depth defines the maximal depth of the tree. 

6. K defines the number of nodes to develop at each step, C the branching factor of the tree and depth its 
maximal depth. 

7. The number of models sampled is inversely proportional to e, while the frequency at which the models 
are sampled is inversely proportional to 5. When an MDP has been sufficiently explored, the number of 
models to sample and the frequency of the sampling will decrease. 
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5.2 Benchmarks 

In our setting, the transition matrix is the only element which differs between two MDPs 
drawn from the same distribution. For each < state, action > pair < x,u >, we define a 
Dirichlet distribution, which represents the uncertainty about the transitions occurring from 

< x,u >. A Dirichlet distribution is parameterised by a set of concentration parameters 

(1) (^x) 

rv^ ' ... ' 

^<X,U> ? 5 ^<X,U> ' 

We gathered all concentration parameters in a single vector 6 . Consequently, our MDP 
distributions are parameterised by pM (the reward function) and several Dirichlet distri¬ 
butions, parameterised by 6 . Such a distribution is denoted by In the Bayesian 

Reinforcement Learning community, these distributions are referred to as Flat Dirichlet 
Multinomial distributions (FDMs). 

We chose to study two different cases: 

• Accurate case: the test distribution is fully known =Pm{-)): 

• Inaccurate case: the test distribution is unknown {p^{.) ^Pm{-))- 


In the inaccurate case, we have no assumption on the transition matrix. We represented 
this lack of knowledge by a uniform FDM distribution, where each transition has been ob¬ 
served one single time (0 = [1, • • • , 1]). 


Sections 5.2.1 
study. 


5.2.2 and 5.2.3 describes the three distributions considered for this 


5.2.1 Generalised Chain distribution (pp “ (•)) 

The Generalised Chain (GC) distribution is inspired from the five-state chain problem (5 


states, 3 actions) (Dearden et al. (1998)). The agent starts at State 1, and has to go through 


State 2, 3 and 4 in order to reach the last state (State 5), where the best rewards are. The 
agent has at its disposal 3 actions. An action can either let the agent move from State 
to State or force it to go back to State The transition matrix is drawn from a 

FDM parameterised by 0*^^, and the reward function is denoted by . More details can 


be found in Appendix B.l 



Figure 3: Illustration of the GC distribution. 


5.2.2 Generalised Double-Loop distribution { pp ' 


,GDL nGDL 


(•)) 


The Generalised Double-Loop (GDL) distribution is inspired from the double-loop problem 
(9 states, 2 actions) (Dearden et al. (|1998)). Two loops of 5 states are crossing at State 1, 
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where the agent starts. One loop is a trap: if the agent enters it, it has no choice to exit but 
crossing over all the states composing it. Exiting this loop provides a small reward. The 
other loop is yielding a good reward. However, each action of this loop can either let the 
agent move to the next state of the loop or force it to return to State 1 with no reward. The 
transition matrix is drawn from an FDM parameterised by 6^^^, and the reward function 
is denoted by More details can be found in Appendix 


B.2 



Figure 4: Illustration of the GDL distribution. 


5.2.3 Grid distribution 

The Grid distribution is inspired from the Bearden’s maze problem (25 states, 4 actions) 
(Bearden et al. ( |1998 )). The agent is placed at a corner of a 5x5 grid (the S cell), and 
has to reach the opposite corner (the G cell). When it succeeds, it returns to its initial 
state and receives a reward. The agent can perform 4 different actions, corresponding to 
the 4 directions (up, down, left, right). However, depending on the cell on which the agent 
is, each action has a certain probability to fail, and can prevent the agent to move in the 
selected direction. The transition matrix is drawn from an FBM parameterised by 
and the reward function is denoted by More details can be found in Appendix B.3 



Figure 5: Illustration of the Grid distribution. 
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5.3 Discussion of the results 

5.3.1 Accurate case 


GC Experiment 


GC Experiment 





1 


IQ 

^ X 



J'T 

?P 

.... ... 



1 1 



















■ X 

■ 




Random 
e-Greedy x 






OPPS-DS □ 
BAMCP ■ 

BFS3 o - 
SBOSS • 

BEB 




1e-06 1e-04 1e-02 1e+00 1e+02 



□ 








" " . 

x □ 

.O"o 

, ^ i 


□ 

□ 

3 

0 

O 0 D 

t 1 

/ 

^ Do** 


O *0 

. 

\ - 



o° o 

'■ m 

■ ■ 


Random 
e-Greedy x 



OPPS-DS □ 
BAMCP ■ 



BFS3 O 
SBOSS • 

BEB 


1e-04 1e-02 1e+00 1e+02 


Ottiine computation cost (in m) 


Oniine computation cost (in ms} 


GDL Experiment 


GDL Experiment 




i 


' ■ ■ 







- # 
! 5 * 

□ 

o 

o 

* oi 

• 

0 

• 0 O Q cP 

0 % cq 

1 o 




o 





e-Greedy 
Soft-max 
_OPPS-DS 

□ 

■ 




BFS3 

SBOSS 

BEB 


1e-06 1e-04 1e-02 1e+00 


Offline computation cost (in m) 


Oniine computation cost (in ms) 


Grid Experiment 





□ 

□ 

□ □ □ 

□ B □ d 
c 

[ 

X 






X 





■ 

X 


□ 

□ Rand 
e-Gree 

m □ 
dy X 




Sott-m 

OPPS- 

BAM 

BF 

SBO 

ax X 

3S □ 

:p ■ 

S3 o 

3S • 

SB A ' 


: . . B 


1e-06 1e-04 1e-02 1e+00 1e+02 

Offiine computation cost (in m) 


Grid Experiment 




Bb #'1 

A (m 


■ 1 • 


1 1 

^ 1 A 1 

.1 

1 ■ 

. : '« 





o 

oocPogv 




Jo Og® 0 DCf§) 

O 0 0 Q 

0 O D 

° Rando 

e-Greec 

Q-'O- •• 

Tl 

y 

X X 

S □ 

P ■ 

3 o 

S • 

B 




Soft-ms 

OPPS-D 

BAMC 

BFS 

SBOS 

: . . BE 


1e-08 1e-06 le-04 1e-02 

Oniine computation cost (in ms) 


Figure 6: Offline computation cost Vs. 
Performance 


Figure 7: Online computation cost Vs. 
Performance 
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GC Experiment 



Random 

e-Greedy 

Soft-max 

OPPS-DS 
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X 
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Random 

e-Greedy 

Soft-max 

OPPS-DS 

BAMCP 

BFS3 

SBOSS 

BEB 


X 

X 

□ 


GC Experiment 


Agent 

Score 

Random 

31.12 ± 0.9 

e-Greedy (e = 0) 

40.62 ±1.55 

Soft-Max (t = 0.1) 

34.73 ±1.74 

OPPS-DS (Q2(x,u)/Qo(x,u)) 

42.47 ± 1.91 

BAMCP {K = 2500, depth = 15) 

35.56±1.27 

BFS3 {K = 500, C = 15, depth = 15) 

39.84± 1.74 

SBOSS {e ^ 0.001, 5-7) 

35.9 ± 1.89 

BEB {13 = 2.5) 

41.72±1.63 


Agent 

GDL Experiment 

Score 

Random 


2.79 ± 0.07 

e-Greedy (e = 

0.1) 

3.05 ± 0.07 

Soft-Max (r = 

0.1) 

2.79 ± 0.1 

OPPS-DS (max(Qo(a:, u), Q2(^) 'f^)l)) 

3.1 ± 0.07 

BAMCP {K 

= 10000, depth — 15) 

3.11 ± 0.07 

BFS3 {K = 1, 

C = 15, depth = 25) 

2.9 ± 0.07 

SBOSS {e - 1 

5 = 1) 

2.81 ± 0.1 

BEB (/3 - 0.5) 


3.09 ± 0.07 


Grid Experiment 


Agent 

Score 

Random 

0.22 ± 0.06 

e-Greedy (e = 0) 

6.9 ± 0.31 

Soft-Max (r = 0.05) 

0 ± 0 

OPPS-DS (Qo(x,u) ± Q2(x,u)) 

7.03 ± 0.3 

BAMCP {K = 25000, depth = 15) 

6.43 ± 0.3 

BFS3 (K = 500, C = 15, depth = 50) 

3.46 ± 0.23 

SBOSS (e - 0.1, 5-7) 

4.5 ± 0.33 

BEB (p - 0.5) 

6.76 ± 0.3 


Figure 8: Best algorithms w.r.t of¬ 
fline/online time periods 


Figure 9: Best algorithms w.r.t Perfor¬ 
mance 
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As it can be seen in Figure OPPS is the only algorithm whose offline time cost varies. 
In the three different settings, OPPS can be launched after a few seconds, but behaves very 
poorly. However, its performances increased very quickly when given at least one minute of 
computation time. Algorithms that do not use offline computation time have a wide range 
of different scores. This variance represents the different possible configurations for these 
algorithms, which only lead to different online computation time. 

On Figure BAMCP, BFS3 and SBOSS have variable online time costs. BAMCP 
behaved poorly on the first experiment, but obtained the best score on the second one and 
was pretty efficient on the last one. BFS3 was good only on the second experiment. SBOSS 
was never able to get a good score in any cases. Note that OPPS online time cost varies 
slightly depending on the formula’s complexity. 

If we take a look at the top-right point in Figure which defines the less restrictive 
bounds, we notice that OPPS-DS and BEB were always the best algorithms in every ex¬ 
periment. e-Greedy was a good candidate in the two first experiments. BAMCP was also a 
very good choice except for the first experiment. On the contrary, BFS3 and SBOSS were 
only good choices in the first experiment. 

If we look closely, we can notice that OPPS-DS was always one of the best algorithm 
since we have met its minimal offline computation time requirements. 

Moreover, when we place our offline-time bound right under OPPS-DS minimal offline 
time cost, we can see how the top is affected from left to right: 

GC: (Random), (SBOSS), (BEB, e-Greedy), (BEB, BFS3, e-Greedy), 

GDL: (Random), (Random, SBOSS), (e-Greedy), (BEB, e-Greedy), 

(BAMCP, BEB, e-Greedy), 

Grid: (Random), (SBOSS), (e-Greedy), (BEB, e-Greedy). 

We can clearly see that SBOSS was the first algorithm to appear on the top, with a very 
small online computation cost, followed by e-Greedy and BEB. Beyond a certain online time 
bound, BFS3 emerged in the first experiment while BAMCP emerged in the second exper¬ 
iment. Neither of them was able to compete with BEB or e-Greedy in the last experiment. 

Soft-max was never able to reach the top regardless the configuration. 

Figure reports the best score observed for each algorithm, disassociated from any 
time measure. Note that the variance is very similar for all algorithms in GDL and Grid 
experiments. On the contrary, the variance oscillates between 1.0 and 2.0. However, OPPS 
seems to be the less stable algorithm in the three cases. 
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5.3.2 Inaccurate case 
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Figure 10: Offline computation cost Vs. 
Performance 


Figure 11: Online computation cost Vs. 
Performance 
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GC Experiment 


Agent 

Score 

Random 

31.67± 1.05 

e-Greedy (e = 0) 

37.69± 1.75 

Soft-Max (r = 0.33) 

34.75 ±1.64 

OPPS-DS (QoCaJ.ii)) 

39.29 ± 1.71 

BAMCP {K = 1250, depth = 25) 

33.S7± 1.26 

BFS3 {K = 1250, C = 15, depth = 25) 

36.87± 1.82 

SBOSS (e - le-06, 5 - 0.0001) 

3S.77± 1.89 

BEB (I3 = 16) 

3S.34± 1.62 


GDL Experiment 

Agent 

Score 

Random 

2.76 ± 0.08 

e-Greedy (e = 0.3) 

2.88 ± 0.07 

Soft-Max (r = 0.05) 

2.76 ± 0.1 

OPPS-DS (max(Qo> ■li) j Ql ! "ia)) 

2.99 ± 0.08 

BAMCP {K= 10000, depth = 50) 

2.85 ± 0.07 

BFS3 (K = 1250, C = 15, depth = 50) 

2.85 ± 0.07 

SBOSS {e ^ 0.1, 5 - 0.001) 

2.86 ± 0.07 

BEB (/3 = 2.5) 

2.88 ± 0.07 
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Agent 

Score 

Random 

0.23 ± 0.06 

e-Greedy (e = 0.2) 

0.63 ± 0.09 

Soft-Max (r = 0.05) 

0 ± 0 

OPPS-DS (Qi(a;,tt) ± Q2(x,u))) 

1.09 ± 0.17 

BAMCP (K = 25000, depth = 25) 

0.51 ± 0.09 

BFS3 {K = 1, C = 15, depth = 50) 

0.42 ± 0.09 

SBOSS {e ^ 0.001, 5 - 0.1) 

0.29 ± 0.07 

BEB (/3 = 0.25) 

0.29 ± 0.05 


Figure 12: Best algorithms w.r.t of¬ 
fline/online time periods 


Figure 13: Best algorithms w.r.t Perfor¬ 
mance 
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As seen in the accurate case, Figure 10 also shows impressive performances for OPPS- 
DS, which has beaten all other algorithms in every experiment. We can also notice that, 
as observed in the accurate case, in the Grid experiment, the OPPS-DS agents scores are 
very close. However, only a few were able to significantly surpass the others, contrary to 
the accurate case where most OPPS-DS agents were very good candidates. 

Surprisingly, SBOSS was a very good alternative to BAMCP and BFS3 in the two hrst 
experiments as shown in Figure 11 It was able to surpass both algorithms on the first one 
while being very close to BAMCP performances in the second. Relative performances of 
BAMCP and BFS3 remained the same in the inaccurate case, even if the BAMCP advantage 
is less visible in the second experiment. BEB was no longer able to compete with OPPS-DS 
and was even beaten by BAMCP and BFS3 in the last experiment. e-Greedy was still a 
decent choice except in the first experiment. As observed in the accurate case. Soft-max 
was very bad in every case. 

In Figure 12, if we take a look at the top-right point, we can see OPPS-DS is the best 
choice in the second and third experiment. BEB, SBOSS and e-Greedy share the first place 
with OPPS-DS in the first one. 

If we place our offline-time bound right under OPPS-DS minimal offline time cost, we 
can see how the top is affected from left to right: 


GC: (Random), (Random, SBOSS), (SBOSS), (BEB, SBOSS, e-Greedy), 

(BEB, BFS3, SBOSS, e-Greedy), 

GDL: (Random), (Random, SBOSS), (BAMCP, Random, SBOSS), 

(BEB, SBOSS, e-Greedy), (BEB, BFS3, SBOSS, e-Greedy), 
(BAMCP, BEB, BES3, SBOSS, e-Greedy), 


Grid: (Random), (Random, SBOSS), (BAMCP, BEB, BES3, Random, SBOSS), 
(e-Greedy). 


SBOSS is again the first algorithm to appear in the rankings. e-Greedy is the only 
one which could reach the top in every case, even when facing BAMCP and BPS3 fed 
with high online computation cost. BEB no longer appears to be undeniably better than 
the others. Besides, the two hrst experiments show that most algorithms obtained similar 
results, except for BAMCP which does not appear on the top in the hrst experiment. In 
the last experiment, e-Greedy succeeded to beat all other algorithms. 


Eigure 13 does not bring us more information than those we observed in the accurate 


case. 


5.3.3 Summary 

In the accurate case, OPPS-DS was always among the best algorithms, at the cost of some 
offline computation time. When the offline time budget was too constrained for OPPS-DS, 
different algorithms were suitable depending on the online time budget: 

• Low online time budget: SBOSS was the fastest algorithm to make better decisions 
than a random policy. 
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• Medium online time budgelj^ BEB reached performances similar to OPPS-DS 
on each experiment. 

• High online time budgel0 In the first experiment, BFS3 managed to catch up BEB 
and OPPS-DS when given sufficient time. In the second experiment, it was BAMCP 
which has achieved this result. Neither BFS3 nor BAMCP was able to compete with 
BEB and OPPS-DS in the last experiment. 

The results obtained in the inaccurate case were very interesting. BEB was not as good 
as it seemed to be in the accurate case, while SBOSS improved significantly compared to 
the others. For its part, OPPS-DS obtained the best overall results in the inaccurate case 
by outperforming all the other algorithms in two out of three experiments while remaining 
among the best ones in the last experiment. 

6. Conclusion 

We have proposed a new extensive BRL comparison methodology which takes into account 
both performance and time requirements for each algorithm. In particular, our benchmark¬ 
ing protocol shows that no single algorithm dominates all other algorithms on all scenarios. 
The protocol we introduced can compare any time algorithm to non-anytime algorithms 
while measuring the impact of inaccurate offline training. By comparing algorithms on 
large sets of problems, we avoid over fitting to a single problem. Our methodology is asso¬ 
ciated with an open-source library, BBRL, and we hope that it will help other researchers 
to design algorithms whose performances are put into perspective with computation times, 
that may be critical in many applications. This library is specifically designed to handle 
new algorithms easily, and is provided with a complete and comprehensive documentation 
website. 
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8. ± too times more than the low online time budget 

9. ± too times more than the medium online time budget 


23 



Castronovo and Ernst and Coetoux and Fonteneau 


Appendix A. Pseudo-code of the algorithms 


Algorithm 1 e-Greedy 
1: procedure OFFLINE-LEARNING(p^(.)) 

2 : M ■<r- “Build an initial model based on 

3: end procedure 

4: 

5: function search(x,/ i) 

6 : {Draw a random value in [0; 1]} 

7: r^U{0,l) 

8 ; 

9: if r < e then {Random case} 

10: return “An action selected randomly” 

11 : 

12: else {Greedy case} 

13: TT*. LVALUE-ITERATION (M) 

M ^ ' 

14: return 

15: end if 

16: end function 

17: 

18: procedure online-learning(x, n, y, r) 

19: “Update model M w.r.t. transition < x,u,y,r >” 

20: end procedure 
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Algorithm 2 Soft-max 


1 

2 

3 

4 

5 

6 
7 


procedure OFFLINE-LEARNING(•)) 

M ■It- “Build an initial model based on p5^(.)” 

end procedure 

function SEARCH(a;, h) 

{Draw a random value in [0; 1]} 
r 1) 


9: 

10 : 

II: 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 


{Select an action randomly, with a probability proportional to Q*^{x,u)} 


Q*^ ^ “Compute the optimal Q-function of M” 
for 1 < f < |C/| do 


if ’’ < Eki 




E„'exp(QA 

return 
end if 
end for 
end fnnction 


then 


procedure online-learning(x, u, y, r) 

“Update model M w.r.t. transition < x, u, y, r >” 

end procedure 
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Algorithm 3 OPPS-DS 


1 

2 

3 

4 

5 

6 

7 

8 


procedure OFFLINE-LEARNING(•)) 

{Initialise the k arms of UCBl} 

for 1 < i < k do 

^ “Simulate strategy vr* on MDP M over a single trajectory” 
/r(i) ^ 

e{i) ^ 1 

end for 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 


{Run UCBl with a budget of /3} 
for k + 1 <h < (3 do 


a ^ arg max„, m(«') + ^/n0Y 


/^(o) ^ 6»{a)+I 

9{a) ^ 6{a) + 1 


‘Simulate strategy TTa on MDP M over a single trajectory” 
e(a)M(a)+RT? 


end for 


{Select the E/E strategy associated to the most drawn arm} 
a* ^ argmax^/ 9{a') 

TTOPPS ^ Pa* 

end procedure 

function search(x, h) 
retnrn u ~ PoPPsix, h) 

end fnnction 


procedure online-learning(x, u, y, r) 

“Update strategy ttqpps w.r.t. transition < x,u,y,r >” 

end procedure 


26 





Benchmarking for Bayesian Reinforcement Learning 


Algorithm 4 BAMCP (1/2) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 


function SEARCH(a;, h) 

{Develop a MCTS and compute (5(., •)} 
for \ <k < K do 
M 

Simulate((x, h ), M, 0) 

end for 

{Return the best action w.r.t. Q {.,.)} 
return argmax^ Qi{x, h), u) 

end function 

function simulate((x, h),M, d) 

if N({x, h)) = 0 then {New node reached} 

“Initialise N{{x, h),u), Q{{x, h),uy' 
u ~ 7ro((x,/i)) 

“Sample x',r from model M” 

{Estimate the score of this node by using the rollout policy} 
i? ^ r + 7 Rollout((x', hux'),P, d) 

“Update N{{x,h)), N{{x,h),u), Q{{x,h),uy’ 

return R 
end if 

{Select the next branch to explore} 
u ^ argrnax^, Q{{x, h),u) + ) 

“Sample x',r from model M” 

{Follow the branch and evaluate it} 

R ^ r + 7 Simulate((x', hux'),M,d+ 1) 

“Update N{{x, h)), N{{x, h),u), Q{{x, h),u)” 

return R 
end function 
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Algorithm 5 BAMCP (2/2) 

1: procedure rollout((x,/ i), M, d) 

2: if ’^'^Rmax < c then {Truncate the trajectory if precision e has been 

reached} 

3: return 0 

4: end if 

5; 

6: {Use the rollout policy to choose the action to perform} 

7: rt ~ 7ro(x, h) 

8 : 

9: {Simulate a single transition from M and continue the rollout process} 

10: y ~ Pm 

11 : r^pM{x,u,y) 

12: return r + 7 ROLLOUT((y, huy),M, d + 1) 

13: end procedure 

14: 

15: procedure online-learning(x, u, y, r) 

16: “Update the posterior w.r.t. transition < x, u, y, r >” 

17: end procedure 
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Algorithm 6 BFS3 
1 ; function search(x, h) 

2: {Update the current Q-function} 

3: Mmean “Compute the mean MDP of p5y^(.).” 

4: for all u G U do 

5; for 1 < i < C* do 

6: {Draw y and r from the mean MDP of the posterior} 

7 : 2 / ~ PMmean 

8 ; r^pM{x,u,y) 

9: 

10: {Update the Q-value in (x, u) by using FSSS algorithm} 

11: Q{x, u) <r- Q{x, u) + ^ [r + 7 FSSS(y, d, t) ] 

12: end for 

13: end for 

14: 

15: {Return the action u with the maximal Q-value in x} 

16: return arg max„ Q{x, u) 

17: end function 


Algorithm 7 FSSS (1/2) 

1: function FSSS{x, d,t) 

2: {Develop a MCTS and compute bounds on U(x)} 

3: for 1 < i < t do 

4: ROLLOUT(s, d, 0) 

5: end for 

6 : 

7: {Make an optimistic estimation of V{x)} 

8: V(x) ^ maxu Ud{x, u) 

9: return V{x) 

10: end function 
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Algorithm 8 FSSS (2/2) 

1: procedure rollout(x, d, 1) 

2: d = I then {Stop when reaching the maximal depth} 

3: return 

4: end if 

5: 

6: if -^Visitedd{x) then {New node reached} 

7: {Initialise this node} 

8 : for all u ^ U do 

9: “Initialise Nd{x,u,x'),Rd{x,u)” 

10: for 1 < i < C do 

11: “Sample x/r from M” 

12: “Update iVrf(x, x'),i?d(3;, n)” 

13: 

14: if -^Visitedd{x') then 

15 - Ud+l{x ), Ld-\-l{^X ) = Vraaxi l^min 

16: end if 

17: end for 

18: end for 

19: 

20: {Back-propagate this node’s information} 

21: BeLLMAN-BACKUP(x, d) 

22 : 

23: Visitedd{x) -^true 

24: end if 

25: 

26: {Select an action and simulate a transition optimistically} 

27: u <— arg max„ Ud{x, u) 

28: x' <r- argmax^, [Ud+i{x') - Ld+i{x'))Nd{x,u,x') 

29: 

30: {Continue the rollout process and back-propagate the result} 

31: ROLLOUt(x', d, ^-h 1) 

32: BeLLMAN-BACKUP(x, d) 

33: retnrn 

34: end procedure 
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Algorithm 9 SBOSS (1/2) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


16: 

17: 


18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 
29 


function SEARCH(a;, h) 

{Compute the transition matrix of the mean MDP of the posterior} 
Mmean ^ “Compute the mean MDP of 


{Update the policy to follow if necessary} 

y[x,u) . I^yx,u) - l^y^x o{x,u,y) 

if t = 1 or : A{x',u') > 5 then 

{Sample some transition vectors for each state-action pair} 

S <— {} 

for all {x,u) G X X U do 

{Compute the number of transition vectors to sample for (x,u)} 


Kt{x, u) ^ maxy 




{Sample Kt{x,u) transition vectors from < x,u >, sampled from 
the posterior} 

for 1 < A: < Kt{x,u) do 

S' <— S' U “A transition vector from < x,u >, sampled from the 

posterior” 

end for 
end for 


M* ^ “Build a new MDP by merging all transitions from S” 
^value-iteration(M^) 

PSBOSS ^FlT-ACT10N-SPACE(7r){^#) 

PlastUpdate ^ Pt 

end if 


{Return the optimal action in x w.r.t. psboss} 
return u ~ irsBOSsix) 

end function 
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Algorithm 10 SBOSS (2/2) 

1: function FIT-ACTION-SPACE(7r)(^^) 

2: for all X G A do 

3: 7r(x) <— mod |?7| 

4: end for 

5; 

6: return vr 
7: end function 

8 ; 

9: procedure online-learning(x, u, y, r) 

10: “Update the posterior w.r.t. transition < x, u, y, r >” 

11: end procedure 


Algorithm 11 BEB 
1: procedure search(x,/ i) 

2: M “Compute the mean MDP of 

3: 

4: {Add a bonus reward to all transitions} 

5: for < x,u,y >G A x ^ x A do pm{x, u, y) ^ Pm{x, u, y) + — 

6: end for 

7: 

8: {Compute the optimal policy of the modified MDP} 

9: TT^ LVALUE-ITERATION (M) 

10 : 

11: {Return the optimal action in x w.r.t. vr^} 

12: return u ~ t^*m{x) 

13: end procedure 

14: 

15: procedure online-learning(x, u, y, r) 

16: “Update the posterior w.r.t. transition < x, u, y, r >” 

17: end procedure 


Appendix B. MDP distributions in detail 

In this section, we describe the MDPs drawn from the considered distributions in more 
detail. In addition, we also provide a formal description of the corresponding 6 (parame- 
terising the FDM used to draw the transition matrix) and pM (the reward function). 

B.l Generalised Chain distribution 

On those MDPs, we can identify two possibly optimal behaviours: 

• The agent tries to move along the chain, reaches the last state, and collect as many 
rewards as possible before returning to State 1; 
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• The agent gives up to reach State 5 and tries to return to State 1 as often as possible. 

B.1.1 Formal description 

X = {1,2,3,4,5}, [7 = 11,2,3} 

Vu G [/ : 

= [1,1,0,0,0] 

= [1,0,1,0,0] 

03^[r = [1,0,0,1,0] 

= [1,0,0,0,1] 

05^[r = [1,1,0,0,1] 

B.2 Generalised Double-Loop distribution 

Similarly to the GC distribution, we can also identify two possibly optimal behaviours: 

• The agent enters the “good” loop and tries to stay in it until the end; 

• The agent gives up and chooses to enter the “bad” loop as frequently as possible. 

B.2.1 Formal description 

X = (1,2,3,4,5,6,7,8,91, U = {1,2} 

yueu: 

= [ 0 , 1 , 0 , 0 , 0 , 1 , 0 , 0 , 0 ] 

= [ 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ] 

= [ 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ] 

= [ 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 ] 

= [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] 

= [ 1 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ] 

= [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 ] 

= [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ] 

= [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] 

B.3 Grid distribution 

MDPs drawn from the Grid distribution are 2-dimensional grids. Since the agents considered 
do not manage multi-dimensional state spaces, the following bijection was defined: 

(1, 2,3,4, 5} X (1, 2, 3,4, 5} ^ X = (1, 2, • • • , 25} : n{i,j) = 5(i - 1) + } 


yuGU : 

= 1.0 
= 2.0 

pGDL^x^u,y) = 0.0, Vx G X, Vy G X : y / 1 


Vx, u G X X U : 

p^^{x, u, 1 ) = 2.0 

p^^{x,u,h) = 10.0 

p^^{x, u, y) = 0.0, Vy G X \ (1, 5} 
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where i and j are the row and column indexes of the cell on which the agent is. 

When the agent reaches the G cell (in (5,5)), it is directly moved to (1,1), and will 
perceive its reward of 10. In consequence. State (5, 5) is not reachable. 

To move inside the Grid, the agent can perform four actions: U = {up, down, left, right}. 
Those actions only move the agent to one adjacent cell. However, each action has a certain 
probability to fail (depending on the cell on which the agent is). In case of failure, the 
agent does not move at all. Besides, if the agent tries to move out of the grid, it will not 
move either. Discovering a reliable (and short) path to reach the G cell will determine the 
success of the agent. 

B.3.1 Formal description 

X = {1,2, ■■■ , 25}, U = {up, down, left, right} 


€ {1,2,3,4,5} X {1,2,3,4,5} 
V(J:,1) e {1,2,3,4,5} X {1,2,3,4,5} : 


(«(i.i)) = 1 . Va€l7 

Ki-l.j)) = l,(i-l)>l 
^n{i,j),down = 1 ; (* + 1 ) < 5 , {i, j) / ( 4 , 5 ) 

CtW (n(i,j-l)) = l,(j-l)>l 

right HiJ + 1)) = 1, {j + 1) < 5, (i,j) / (5,4) 

O^Ciuo^ninihl)) =1 

(^S^%,r^ghtin{l,l)) =1 

=0,else 

Appendix C. Paired sampled Z-test 

Let It A and tt^ be the two agents we want to compare. We played the two agents on the 
same N MDPs, denoted by Mi, ■ ■ ■ , Mfg. Let and be the scores we observed for 
the two agents on Mi. 


p^^^'^{{4:,5),down,{l,l)) = 10.0 
{{5,4),right, (1,1)) = 10.0 

p^^^‘^{(i,j),u, (k,l)) =0.0, yuGU 
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Step 1 - Hypothesis 

We compute the mean and the standard deviation of the differences between the two sample 
sets, denoted by Xd and Sd, respectively. 

1 ^ 

i=l 

I ^ 

5^ = - Ku - -Rff,))' 

i=l 

If > 30, Sd is a good estimation of ad, the standard deviation of the differences between 
the two populations {sd ~ ad)- In order words, ad is the standard deviation we should 
observe when testing the two algorithms on a number of MDPs tending towards infinity. 
This was always the case in our experiments. 

We now set Hypothesis Hq and Hypothesis Ha- 

Hq : Hd = 0 

Our goal is to determine if fid, the mean of the differences between the two populations, is 
equal or greater than 0. More expressly, we want to know if the differences between the two 
agents’ performances is significant {Ha is correct) or not {Hq correct). Only one of those 
hypotheses can be true. 

Step 2 - Test statistic 

The test statistic consists to compute a certain value Z: 

Z = 

adjVN 

This value will help us to determine if we should accept (or reject) hypothesis Ha- 

Step 3 - Rejection region 

Assuming we want our decision to be correct with a probability of failure of a, we will 
have to compare Z with Za, a value of a Gaussian curve. If Z > Zq, it means we are in 
the rejection region (R.R.) with a probability equal to 1 — a. For a confidence of 95%, Za 
should be equal to 1.645. 



Being in the R.R. means we have to reject Hypothesis Hq (and accept Hypothesis Ha)- 
In the order case, we have to accept Hypothesis Hq (and reject Hypothesis Ha)- 
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Step 4 - Decision 

At this point, we have either accepted Hypothesis Hq or Hypothesis Ha- 

• Accepting Hypothesis Hq {Z < Za)- The two algorithms tt^ and ttb are not 
significantly different. 

• Accepting Hypothesis Ha {Z > Za)'- The two algorithms tta and ttb are signif¬ 
icantly different. Therefore, the algorithm with the greatest mean is definitely better 
with 95% confidence. 
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